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Abstract. We present a technique for the estimation of photometric redshifts based on feed-forward neural 
networks. The Multilayer Perceptron (MLP) Artificial Neural Network is used to predict photometric redshifts 
in the HDF-S from an ultra deep multicolor catalog. Various possible approaches for the training of the neural 
network are explored, including the deepest and most complete spectroscopic redshift catalog currently available 
(the Hubble Deep Field North dataset) and models of the spectral energy distribution of galaxies available in 
the literature. The MLP can be trained on observed data, theoretical data and mixed samples. The prediction 
of the method is tested on the spectroscopic sample in the HDF-S (44 galaxies). Over the entire redshift range, 
0.1 < z < 3.5, the agreement between the photometric and spectroscopic redshifts in the HDF-S is good: the 
training on mixed data produces al est ~ 0.11, showing that model libraries together with observed data provide 
a sufficiently complete description of the galaxy population. The neural system capability is also tested in a low 
redshift regime, < z < 0.4, using the Sloan Digital Sky Survey Data Release One (DR1) spectroscopic sample. 
The resulting accuracy on 88108 galaxies is crj ~ 0.022. Inputs other than galaxy colors - such as morphology, 
angular size and surface brightness - may be easily incorporated in the neural network technique. An important 
feature, in view of the application of the technique to large databases, is the computational speed: in the evaluation 
phase, redshifts of 10 5 galaxies are estimated in few seconds. 

Key words. Galaxies: distances and redshifts - Methods: data analysis - Techniques: photometric, Neural Networks 



1. Introduction 

Deep multicolor surveys, using a selection of broad- 
and/or intermediate-band filters to simultaneously cover 
the spectral energy distribution (SED) of a large num- 
ber of targets, have been an important part of astronomy 
for many years but have remarkably surged in popularity 
in recent times. Digital detectors and telescopes with im- 
proved spatial resolution in all wavelength regimes have 
enabled astronomers to reach limits that were unthink- 
able only a few decades ago and are now revealing ex- 
tremely faint sources (see for a review Cristiani, Renzini 
& Williams 2001). A general hindrance for the transforma- 
tion of this wealth of data into cosmologically useful infor- 
mation is the difficulty in obtaining spectroscopic redshifts 
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of faint objects, which, even with the new generation of 
8m-class telescopes, is typically limited to I(AB)~25. This 
has spurred a widespread interest in the estimation of the 
redshift directly from the photometry of the targets (pho- 
tometric redshifts). Major spectral features, such as the 
Balmer Break or the Lyman limit, can be identified in 
the observed SED and, together with the overall spectral 
shape, make possible a redshift estimation and a spectral 
classification. 

The photometric redshift techniques described in the 
literature can be classified into two broad categories: the 
so-called empirical training set method, and the fitting of 
the observed Spectral Energy Distributions by synthetic or 
empirical template spectra. In the first approach (see, for 
example, |Connolly et al. 19 95 ) , an empirical relation be- 
tween magnitudes and redshifts is derived using a subsam- 
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pie of objects in which both the redshifts and photometry 
are available (the so-called training set). A slightly modi- 
fied version of this method was used by |Wang"et al. 1998 
to derive redshifts in the HDF-N by means of a linear 
function of colors. 

In the SED-fitting approach a spectral library is 
used to compute the colors of various types of sources 
at any plausible redshift, and a matching technique 
is applied to obtain the "best-fitting" redshift. With 
different implementations, this method has been used in 
the HDF-N (|Le Borgne fc Rocca-Volmerange 20021 
IMassarotti et al. 200l1 ISawicki et al. 19971 

I Fernandez-Soto, Lanzetta fc Yahil 1999| IBemtez 200"0l 
lArnouts et al. 1999afl and ground-based data 
HGiallongo et al. 20001 IFontana et al. 19991 

IFontana et al. 2000jl . 

A crucial test in all cases is the comparison between the 
photometric and spectroscopic redshifts which is typically 
limited to a subsample of relatively bright objects. 

In the present work, photometric redshifts have been 
obtained using a Multilayer Perceptron Neural Network 
(MLP) with the primary goal of recovering the correct 
redshift distributions up to the highest redshifts in deep 
fields such as the HDFs. The method has been tested 
on the HDF-S spectroscopic sample (0.1<z<3.5) and on 
a sample of galaxies (in a relatively low-redshift regime 
0<z<0.4) from the Sloan Digital Sky Survey Data Release 
One (SDSS DRl, |Abazajian et al. 2003| ). 

The structure of this paper is as follows: in Section 2 
we give an introduction to the neural network methods. 
Section 3 describes the training set for the HDF-S and 
Section 4 the training technique. In Section 5 we apply 
the method to the spectroscopic sample in the HDF-S. 
An application to the SDSS DR1 samples is described in 
section 6. Section 7 is dedicated to a general discussion. 
Our conclusions are summarized in Section 8. 

2. Artificial Neural Networks 

According to the DARPA Neural Network Study (1988, 
AFCEA International Press), a neural network is a system 
composed of many simple processing elements operating 
in parallel whose function is determined by the network 
structure, connection strengths, and the processing per- 
formed at the computing elements or nodes. 

An artificial neural network has a natural proclivity 
for storing experimental knowledge and making it avail- 
able for use. The knowledge is acquired by the network 
through a learning process and the interneuron connec- 
tion strengths - known as synaptic weights - are used to 
store the knowledge ( Haykin 1994| ). 

There are numerous types of neural networks (NNs) 
for addressing many different types of problems, such as 
modelling memory, performing pattern recognition, and 
predicting the evolution of dynamical systems. Most net- 
works therefore perform some kind of data modelling. 

The two main kinds of learning algorithms are: 
supervised and unsupervised. In the former the correct 
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Fig. 1. A general scheme of a multilayer Perceptron feed- 
forward neural network. 

results (target values) are known and given to the NN 
during the training so that the NN can adjust its weights 
to try to match its outputs to the target values. In the 
latter, the NN is not provided with the correct results 
during training. Unsupervised NNs usually perform some 
kind of data compression, such as dimensionality reduc- 
tion or clustering. 

The two main kinds of network topology are feed- 
forward and feed-back. In feed-forward NN, the connec- 
tions between units do not form cycles and usually pro- 
duce a relatively quick response to an input. Most feed- 
forward NNs can be trained using a wide variety of efficient 
conventional numerical methods (e.g. conjugate gradients, 
Levenberg-Marquardt, etc.) in addition to algorithms in- 
vented by NN researchers. In a feed-back or recurrent NN, 
there are cycles in the connections. In some feed-back NNs, 
each time an input is presented, the NN must iterate for 
a potentially long time before producing a response. 

2.1. The Multilayer Perceptron 

In the present work we have used one of the most im- 
portant types of supervised neural networks, the feed- 
forward multilayer perceptron (MLP), in order to produce 
photometric redshifts. The term perceptron is historical, 
and refers to the function performed by the nodes. An in- 
troduction on Neural Networks is provided bv lSarle 1994al 
and on multilayer Perceptron bv IBailer- Jones et al. 20011 
and ISarle 1994bl A comprehensive treatment of feed- 
forward neural networks is provided by Bishop 1995 

In Fig.^the general architecture of a network is shown. 
The network is made up of layers and each layer is fully 
connected to the following layer. The layers between the 
input and the output are called hidden layers and the 
correspondent units, hidden units. 

For each input pattern, the network produces an out- 
put pattern through the propagation rule, compares the 
actual output with the desired one and computes an error. 
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The learning algorithm adjusts the weights of the con- 
nections by an appropriate quantity to reduce the error 
(sliding down the slope). This process continues until the 
error produced by the network is low, according to a given 
criterion (see below). 

2.1.1. The propagation rule 

An input of a node (netj) is the combination of the 
output of the previous nodes (oj) and the weights of 
the corresponding links (wij), the combination is lin- 
ear: netj — J2i w ij°i- Each unit has a transform func- 
tion (or activation function), which provides the output 
of the node as a function of the net. Nonlinear activa- 
tion functions are needed to introduce nonlinearity into 
the network. We have used the logistic (or sigmoid) func- 
tion: out = 1/[1 + exp(— Knet)] and the tanh function 
out=tanh(Knet) , for all units. K is the gain parameter 
fixed before the learning. By increasing K the activation 
function approximates a step. The propagation rule, from 
the input layer to the output layer, is a combination of 
activation functions. 

No significant difference has been found in the training 
process between using the logistic and tanh functions. 

2.1.2. Back-propagation of the error 

The weights, w, are the free parameters of the network 
and the goal is to minimize the total error function with 
respect to w (maintaining a good generalization power, 
see below). 

The error function in the weight space defines the 
multi-dimensional error surface and the objective is to 
find the global (or acceptable local) minima on this sur- 
face. The solution implemented in the present work is the 
gradient descent, within which the weights are adjusted 
(from small initial random values) in order to follow the 
steepest downhill slope. The error surface is not known in 
advance, so it is necessary to explore it in a suitable way. 

The error function typically used is the sum-of-squares 
error, which for a single input vector, n, is 

e {B> = ^EA(»f B> -2? B} ) a « 

i 

where y^ is the output of the NN and T$ is the target 
output value for the i th output node and n runs form 1 to 
the total number of examples in the training set. In the 
present work i=l, a single output node is used to estimate 
the redshift (other nodes could be used to estimate other 
quantities, such as the spectral type). The (3i terms make 
it possible to assign different weights to different outputs, 
and thereby give priority to the correct determination of 
certain outputs. In the gradient descent process the weight 
vector is adjusted in the negative direction of the gradient 
vector, 

Aw = (2) 



and the new generic weight is 

Wnew = W ld + All! 

The amplitude of the step on the error surface is set 
by the 77-learning parameter: large values of n mean large 
steps. Typically ij belongs to the interval (0,1], in this 
application a small value has been used (< 0.005) together 
with a high value of the gain in the activation functions 
(K = 5). If r\ is too small the training time becomes very 
long, while a large value can produce oscillations around 
a minimum or even lead to miss the optimal minimum in 
the error surface. 

The learning algorithm used in the present work is 
the standard back — propagation. It refers to the method 
for computing the gradient of the case-wise error function 
with respect to the weights for a feed-forward network. 
"Standard backprop" is a definition of the generalized 
delta rule, the training algorithm that was popularized 
by Rumclhart, Hinton, and Williams in chapter 8 of 
Rumelhart and McClelland (1986), which remains one of 
the most widely used supervised training methods for neu- 
ral nets. 

This learning algorithm implies that the error func- 
tion is continuous and derivable, so that it is possible 
to calculate the gradient. For this reason the activation 
functions (and their final combination through the prop- 
agation rule) must be continuous and derivable. From the 
computational point of view, the derivative of the activa- 
tion functions adopted in the present work is easily re- 
lated to the value of the function out = F(net) itself (see 
Sec. 2.1.1: F' oc out(l — out) in the case F = sigmoid or 
F' oc (1 - out 2 ) if F — tanh.) 

When the network weights approach a minimum solu- 
tion, the gradient becomes small and the step size dimin- 
ishes too, giving origin to a very slow convergence. Adding 
a momentum (a residual of the previous weight variation) 
to the equations of the weight update, the minimization 
improves | |Bishop 1995| ): 

Wnew = Wold + Aw + aAw a id (3) 

where a is the momentum factor (set to 0.9 in our applica- 
tions). This can reduce the decay in learning updates and 
cause the learning to proceed through the weight space in 
a fairly constant direction. Besides a faster convergence 
to the minimum, this method makes it possible to escape 
from a local minimum if there is enough momentum to 
travel through it and over the following hill (see Fig. |2J). 
The generalized delta rule including the momentum is 
called the "heavy ball method" in the numerical analysis 
literature ( |Bertsekas 1995, pg. 78-790 . 

The learning algorithm has been used in the so called 
on-line (or incremental) version, in which the weights of 
the connections are updated after each example is pro- 
cessed by the network. One epoch corresponds to the pro- 
cessing of all examples one time. The other possibility is 
to compute the training in the so called batch learning (or 
epoch learning), in which the weights are updated only at 
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Fig. 2. A simplified representation of the error surface: 
the behavior of the error as a function of 2 weights. The 
momentum term improves the minimization during the 
training phase. Momentum allows a network to respond 
to the local gradient and also to take into account of the 
recent trends in the error surface. Acting like a low-pass 
filter, momentum allows the network to ignore small fea- 
tures in the error surface. Without momentum a network 
may get stuck in a shallow local minimum. With momen- 
tum a network can slide through such a minimum. 

the end of each epoch (not used in the present applica- 
tion). 



3. The training technique 

During the learning process, the output of a supervised 
neural net comes to approximate the target values given 
the inputs in the training set. This ability may be useful in 
itself, but more often the purpose of using a neural net is to 
generalize, i.e. to get some output from inputs that are not 
in the training set (generalization) . NNs, like other flexi- 
ble nonlinear estimation methods such as kernel regression 
and smoothing splines, can suffer from either under fitting 
or over fitting. A network that is not sufficiently com- 
plex 1 can fail to fully detect the signal in a complicated 
data set, leading to under fitting: an inflexible model will 
have a large bias. On the other hand a network that is 
too complex may fit the noise, not just the signal, lead- 
ing to over-fitting: a model that is too flexible in relation 
to the particular data set will produce a large variance, 
( Sar le 1995J) . The best generalization is obtained when the 
best compromise between these two conflicting quanti- 
ties (bias and variance) is reached. There are several ap- 
proaches to avoid under- and over-fitting, and obtain a 



1 The complexity of a network is related to both the num- 
ber of weights and the amplitude of the weights (the mapping 
produced by a NN is an interpolation of the training data, a 
high order fit to data is characterized by large curvature of the 
mapping function, which in turn corresponds to large weights). 



good generalization. Part of them aim to regularize the 
complexity of the network during the training phase, such 
as the Early Stopping and weight — decay methods (the 
size of the weights are tuned in order to produce a map- 
ping function with small curvature, the large weights are 
penalized. Reducing the size of the weights reduces also 
the "effective" number of weights ( |Moody et al. 1 992)). 

A complementary technique belongs to the Bayesian 
framework, in which the bias-variance trade off is not so 
relevant, and networks with high complexity can be used 
without producing over-fitting (an example is to train a 
committee of networks, |Bishop 1995| ). 

3.1. Generalize error 

3.1.1. Early stopping 

The most commonly used method for estimating the gen- 
eralization error in neural networks is to reserve part of 
the data as a test set, which must not be used in any way 
during the training. After the training, the network is ap- 
plied to the test set, and the error on the test set provides 
an unbiased estimate of the generalization error, provided 
that the test set was chosen in a random way. 

In order to avoid (possible) over-fitting during the 
training, another part of the data can be reserved as a 
validation set (independent both of the training and test 
sets, not used for updating the weights), and used during 
the training to monitor the generalization error. The best 
epoch corresponds to the lowest validation error, and the 
training is stopped when the validation error rate "starts 
to go up" (early stopping method). The disadvantage of 
this technique is that it reduces the amount of data avail- 
able for both training and validation, which is particularly 
undesirable if the available data set is small. Moreover, 
neither the training nor the validation make use of the 
entire sample. 

3.1.2. Committees of networks 

As mentioned in the previous sections, an over-trained NN 
tends to produce a large variance in the predictions main- 
taining a relatively small bias. A method that reduces the 
variance (and keeps small the bias) is to use a commit- 
tee of NNs (Bishop 1995). Each member of the committee 
differs from the other members for the different training 
history. We have generated the members using a bootstrap 
process, varying: 

1. the sequence of the input patterns (the incremental 
learning method used in the present work is dependent 
on the sequence presented). 

2. the initial distribution of weights (the starting point 
on the error surface). 

3. the architecture of the NN (number of nodes and lay- 
ers). 

The final prediction, adopted in the present work, is 
the mean and the median of the predictions obtained from 
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the members of the committee (with l-er error or 16 and 84 
percentiles) . Averaging over many solutions means reduc- 
ing the variance. Since the complexity of the individual 
member is not a problem, the trainings have been per- 
formed without regularization and at the lowest training- 
error the weights have been frozen and used for the pre- 
diction. 

This method has displayed a better and stable general- 
ization power with respect to a single training (also using 
the validation set to regularize the learning). Moreover 
this method gives a robust estimate of the error bounds 
for the output of the network. 

For these reasons the training described in the next 
sections has been carried out using a committee of net- 
works. 



4. The training-set 

Since we are using a supervised neural network, we need a 
training-set. Each element (example) in the training-set is 
composed of a pair of vectors: the input pattern and the 
target. For our purposes the input pattern contains the 
Spectral Energy Distribution (SED) of the objects (but 
other configurations are possible: templates with a priori 
knowledge, SED plus the apparent luminosity in a refer- 
ence band, the angular size, the morphology, etc.). The 
target in this application is the redshift. 

The training has been tested on the available spec- 
troscopic sample in the HDF-S IjCristiani et al. 2 000 
|Rigopoulou et al. 2000| IVanzella et al. 20021 Glazebrook 
et al., http://www.aao.gov.au/hdfs/Redshifts/). The 
prediction of the redshifts in the HDF-S have been com- 
puted following different approaches: 

1. training on the HDF-N spectroscopic sample using the 
colors as an input pattern. 

2. training on the HDF-N spectroscopic sample using the 
colors and the apparent luminosity in the I band as an 
input pattern. 

3. training on both HDF-N spectroscopic sample and a 
set of templates obtained from CWW (Coleman, Wu 
& Weedman) and/or from Rocca-Volmerange and Fioc 
(labelled RVOO hereafter). 

4. Training on the CWW or RVOO SEDs alone (without 
spectroscopic redshifts) have also been tested. 

The photometry of the HDF-N has been ob- 
tained from the available catalog provided by 
|Fernandez-Soto, Lanzetta fc Yahil 1999| whereas the 
photometric catalog of the HDF-S is provided by 
IVanzella et al. 20011 and IFontana et al. 20031 

The sample in the HDF-N contains 150 spectro- 
scopic redshifts IjCohen et al.^OOOl IDawson et ~ 2001 
IFernandez-Soto et al. 200l")l . while the sample in the 
HDF-S contains 44 spectroscopic redshifts (in Fig. the 
redshift distributions of both fields are shown) . 
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Fig. 3. Spectroscopic redshift distributions of the two 
fields HDF-N (dashed line) and HDF-S (solid line). 

In order to test the prediction we have used the vari- 
ance as a statistical estimator: 



N ^ y 



zspeci) 



(4) 



where zNN is the neural prediction, N is the number of 
galaxies, and i=l..N. In the literature another statistical 
estimator is sometimes used, the mean absolute deviation 
normalized by the (1 + z) factor fe.g. lLabbe et al. 2003(1 : 



S z = 



1 



\zNN, 



zspeci 



N ^ 1 



zspeci 



(5) 



The quantity S z has the advantage to be roughly uniform, 
while the variance tends to increase with increasing red- 
shift. 



4.1. The input pattern 

The magnitudes of the observed objects in a given photo- 
metric system are the input of the network. In the present 
work the filters are F300, F450, .F606, F814 (WFPC2, 
HST) and Js, H, Ks for the near infrared (ISAAC, VLT). 
If the flux in a given band has a signal to noise ratio less 
than 2.0 it is considered an upper limit in that band, and 
the value of the flux is set to la error. 

It is convenient to avoid too large input values that 
could cause a saturation in the output of the activation 
functions (sigmoid or tanh), but it is not necessary to 
rescale the inputs rigorously in the interval [-1,1]- A non 
linear rescaling of the input values is also useful to make 
more uniform the function that the network is trying to 
approximate. 
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Fig. 4. Comparison between spectroscopic redshift in the 
HDF-S and the neural redshift using the colors as an in- 
put pattern. The training has been done on the HDF-N 
spectroscopic sample, the estimation of the redshift for 
each object is the median of 100 predictions and the error 
bars represent 1-er interval. Open circles represent objects 
with unreliable photometry and triangles are objects with 
uncertain spectroscopic redshift. 



In the present application the input values have been 
rescaled: pi = —0.5 + [fi/ Jfsu] ' 4 , where i runs over the 
following bands: F300, F450, F606, Js, H, Ks and f F8U 
is the flux in the reference F814 band. When the apparent 
AB magnitude in the F814 band, rogu, is used as an input 
(e.g. Sec. 5.1.2 and 5.2.2), it has been normalized as follow: 



Fig. 5. The effects of adding information. Upper panel: 
comparison between spectroscopic redshift and the neu- 
ral redshift for the spectroscopic sample in the HDF-S. 
The training has been carried out on the HDF-N spectro- 
scopic sample (150 objects, as shown in the lower panel 
of the Fig. |1J. The partial error (a pa rt) has been con- 
sidered, i.e. the dispersion calculated without the three 
objects marked with the open square symbols, see text). 
Lower panel: comparison between spectroscopic redshift 
and the neural redshift for the spectroscopic sample in 
the HDF-S, the open squares symbols show the three ob- 
jects that have been used during the training (in addition 
to the 150 objects in the HDF-N), this new information 
improves the partial error (i.e. the a par t calculated with- 
out these three objects), in particular at redshift around 
1.2. 



PF814 = 



([771814 — nimin] — [m max — 771814]) 



where m rnax is 28 and m m in is 18. 

5. Redshift prediction on the HDF-S 

5.1. Training on the HDF-N 
5.1.1. Colors as input pattern 

The input pattern contains the colors of the galaxies 
(Xooo Xf4so fpm fj_ Jh_ _Jk_\ normalized as 

v /F814 ' JF814 ' JF814 ' JF814 ' JF814 ' JF814 ' ' 

described in Section 4.1. 

The training has been carried out setting the max- 
imum number of epochs to 5000. The distribution of 
weights corresponding to the minimum training error 
has been stored. We have verified that 5000 epochs 
are sufficient in this case to reach the convergence 



of the system. Trainings done on 10000 and 15000 
epochs give similar results. 

The dispersion <r* es * obtained for the spectroscopic 
sample in the HDF-S is shown in Table ^ (left side) . 
Different architectures have been used with one and 
two hidden layers and different numbers of nodes. 

The comparison between zspec and zNN for the 
architecture 6:10:5:1 (six input nodes, two hidden 
layers with ten and five units and one output nodes) 
is shown in Fig.rjJ The resulting error is cr* es * = 0.172. 
The systematic errors are common to all the explored 
architectures. In particular there is a clear discrep- 
ancy for the object at z = 0.173 (ID=667 in the 
Tables of I Vanzella et al. 200 1|) , due to the insufficient 
information available in that redshift regime. A sys- 
tematic underestimation for the group of objects at 
redshift around 1.2 is also evident. Combining dif- 
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Table 1. Training of different architectures on the HDF-N spectroscopic sample (150 objects) and evaluation on the 
HDF-S spectroscopic sample. The number of epochs is 5000, the bootstrap has been computed on 100 extractions 
(100 members of the committee). 
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median/mean 
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median/mean 


median/mean 


[6:10:10:l]_isi 


0.100 
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0.074/0.078 


[7:10:10:l]_2oi 


0.090 


0.163/0.171 


0.065/0.065 


[6:10:9:l]_i79 


0.103 


0.191/0.191 


0.074/0.075 


[7:10:9:l]_is9 


0.087 


0.174/0.173 


0.067/0.065 


[6:10:8:l]_i67 


0.103 


0.193/0.203 


0.074/0.079 


[7:10:8:l]_i77 


0.083 


0.166/0.172 


0.066/0.067 


[6:10:7:l]_is5 


0.107 


0.192/0.195 


0.074/0.075 


[7:10:7:1] _i65 
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ferent architectures with different numbers of units 
in the second hidden layer (from 1 to 12), the re- 
sult does not change, the dispersion in the test set 
is compatible with the dispersion obtained using a 
fixed architecture. 

For networks with a low complexity the error 
(cr* es *) starts increasing together with the < otrain > 
(the < (Ttrain > is the mean of the training er- 
rors {(Jtr-ain) obtained in the bootstrap). The same 
happens with networks with one hidden layer (see 
Table HJ. 

These results show that, although one hundred 
extractions (100 members) are enough to diminish 
the random errors, new information in the training 
set is needed in order to reduce the systematic errors. 

This is clearly shown in Fig. [S] where we have 
added to the training set three objects belonging to 
the HDF-S spectroscopic sample: ID=667 with the 
discrepant redshift mentioned above and two objects 
randomly chosen from the group around redshift 1.2. 
In the upper panel of Fig. the square symbols rep- 
resent these three objects used in the training to- 
gether with the 150 in the north, the dispersion in 
the HDF-S is calculated on the rest of the sample 
(41 objects, 

&part)- The training on the 150 objects 



gives as prediction a pa rt =0.145. By computing the 
training in the same conditions but with 153 objects 
rather than 150, the prediction around redshift 1.2 
clearly improves, and a <7 par t=0.093 is obtained. The 
predictions for the rest of the objects do not change 
significantly. The improvement for the square sym- 
bols is obvious (it is due to the learning algorithm). 
The network shows a remarkable ability to learn the 
new signal present in the training set. 

In the next section the colors together with the 
apparent luminosity in the F814 band will be used 
as input pattern. 

5.1.2. Colors and apparent luminosity as an input 
pattern 

The input pattern contains the colors and the appar- 
ent luminosity in the F814 band. Also in this case 
we have performed one hundred training on the 150 
galaxies in the HDF-N. The dispersion a t z est obtained 
for the spectroscopic sample in the HDF-S is shown 
in Table n (right side, "colors & mag."). In general, 
the predictions are better than the results obtained 
with only colors as an input pattern. In this appli- 
cation the magnitude information improves the pre- 
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Fig. 6. Comparison between spectroscopic redshift and 
the neural redshift for the spectroscopic sample in the 
HDF-S. The training has been carried out on the HDF- 
N spectroscopic sample, the estimation of the redshift for 
each object is the median of 100 predictions. The input 
pattern is composed of colors and the apparent luminos- 
ity in the F814 band. The symbols are the same as in 
Fig. H 



Fig. 7. Comparison between spectroscopic redshift in the 
HDF-S and the neural redshift obtained with a committee 
of networks and using as input pattern the colors. The 
estimation of the redshift for each object is the mean of 
100 predictions and the error bars represent 1-a interval. 
The training set is composed by CWWK SEDs mixed with 
the spectroscopic sample in the HDF-N. The symbols are 
the same as in Fig. 0] 



diction at low redshift (in particular for the object 
ID=667). On the other hand the scatter at high red- 
shift seems to increase, if compared with the case 
with only colors as an input (see Fig. |f)J). There is 
still a bias (although reduced) at redshifts around 
1.2. 

The training on different architectures, 6:10:1:1, 
6:10:2:1, ... and 6:10:12:1 (6:10:1. .12:1 hereafter) pro- 
duces a dispersion similar to that obtained by fixing 
the architecture. Also in this case the networks with 
a low complexity produce a large error both in the 
< Strain > and in the <r* es *. The same happens with 
networks with one hidden layer (see Table [IJ. 

These tests show that the information introduced 
by the apparent luminosity produces a slight im- 
provement: the error is always less than the error 
obtained using only colors (but the sample is still 
too small to generalize this result). 

The problem concerning the completeness of the 
training set is common in the empirical technique for 
the estimation of the redshift. There is a well known 
gap without spectroscopic redshifts in the interval 
(1.3,2) due to the absence of observational spectro- 
scopic features. Moreover, spectroscopic surveys are 



flux limited and the spectroscopic redshifts tend to 
be available only for brighter objects. To solve this 
problem and fill the above mentioned gap it is use- 
ful to introduce in the training set examples derived 
from observed or synthetic template SEDs. 

5.2. Combination of training sets 

5.2.1. Training on HDF-N mixed with CWW SEDs 

Increasing the information in the training data is an 
obvious method to improve the generalization. 

As a first approach to produce a complete range 
of galaxy SEDs we have adopted the templates of 
Coleman, Wu & Weedman (1980) for a typical ellip- 
tical, Sbc, Scd and Irregular galaxy plus two spectra 
of star-forming galaxies (SB1 and SB2 from the atlas 
of Kinney et al. 19961). This choice is similar to the 

1 : — *s — — — , ji , . i .i i 



approach of Fernandez-Soto, Lanzetta &; Yahil 1999 
and lArnouts et liT^T99"9al and in the following will be 
referred to as "CWWK". 

Galaxies have been simulated in the redshift 
range 0<z<6. 3206 SEDs have been drawn from the 
CWWK templates with a step in redshift equal to 
0.01 (dz = 0.01). Extinction effects have been in- 
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Table 2. Training of different architectures on the HDF-N spectroscopic sample and a set of templates derived from 
CWWK. The evaluation is on the HDF-S spectroscopic sample. The bootstrap has been computed on 100 extractions 
(100 members of the committee). In the "training data" column, "+150" means that the 150 spectroscopic redshifts 
in the HDF-N have been used in addition to the CWWK SEDs. 
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* Training on different architectures, in the second hidden layer the number of units ranges from 1 to 12 (6:10:1. .12:1). 



troduced {E(B - V) = 0.05,0.1,0.2 ) adopting a 
Calzetti extinction law (jCalzetti 1997|) . 12824 SEDs 
have been produced in this way. The CWWK tem- 
plates do not take into account the evolution of 
galaxy SEDs with cosmic time. 

A committee of 100 networks has been adopted 
and the median and mean values have been used to 
estimate the redshift. 

In Table El the prediction for the HDF-S spectro- 
scopic sample is shown. A series of tests has been car- 
ried out both neglecting the effects of intrinsic extinc- 
tion and introducing an extinction effect. No signifi- 
cant difference in the predictions has been measured. 
The number of training data and the < (Jtrain > are 
also shown. 

The predictions for the HDF-S are clearly im- 
proved taking into account the information derived 
from the CWWK templates and remain stable almost 
independently of the architecture (fj' e,s * ~ 0.13). Low 
complexity networks (6:7:6:1 and 6:5:5:1) produce 
large errors: these are clear cases of under-fitting in 
the training data. In Fig. Q the comparison between 
the spectroscopic redshifts and the neural predictions 
is shown for the network 6:15:15:1 and bootstrap pro- 
cess. The prediction improves at redshift around 1 
and for the object ID =66 7 at z = 0.173. At high 
redshift (z > 2) the uncertainty of the individual 
redshift estimates is significantly reduced (compare, 
for example, the error bars at z > 2 in Fig. |1] and in 

Fig. ED. 

Reducing the step in redshift (dz=0.01, 0.05, 0.1) 
and hence the number of training data, leaves the 



prediction stable. The trainings computed on a re- 
duced sample, 326+150 examples (326 CWWK SEDs 
and 150 spectroscopic redshifts in the HDF-N) with 
dz = 0.1 and 646+150 examples with dz = 0.05 with- 
out extinction, give the same result obtained with 
dz = 0.01. This means that the committee of net- 
works is able to achieve the same fit in the color space 
also with a reduced grid. 

5.2.2. Training on the HDF-N mixed with Pegase 
models 

We have also trained the neural system on 
the HDF-N spectroscopic sample and a set of 
models derived from the most recent version 
of the code by M. Fioc and Rocca-Volmerange 
( |Fioc Rocca-Volmerange 1997 ), named Pegase 2.0 
(RV00). 

In the Rocca-Volmerange code the star formation 
history is parameterized by two e — folding star for- 
mation time-scales, one (r 9 ) describing the time-scale 
for the gas infall on the galaxy and the other (r*) the 
efficiency of gas to star conversion. By tuning the 
two time-scales it is possible to reproduce a wide 
range of spectral templates, from early types (by us- 
ing small values of r g and r*) to late types. For the 
earliest spectral type, a stellar wind is also assumed 
to block any star-formation activity at an age t w i nc [. 
The major advantage of the Rocca-Volmerange is 
that it allows to follow explicitly the metallicity evo- 
lution, including also a self-consistent treatment of 
dust extinction and nebular emission. Dust content 
is followed over the galaxy history as a function of 
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Fig. 9. Predictions of a (6:20:20:1) NN for 44 galaxies in the HDF-S as a function of the epoch (an epoch correspond 
to the processing of all the examples one time, as defined in Sect. 2.1.2). The training has been carried out on the 
spectroscopic sample in the HDF-N and on RVOO templates, using as an input pattern the colors and the I mag. The 
ordinate shows the difference between the prediction of the NN, z^n, at a given epoch and the actual spectroscopic 
redshift z spec . The numbers in the upper left part of the panels correspond to the galaxy identifiers in the catalog by 
IVanzella et al. 20011 



the on-going star-formation rate, and an appropri- 
ate average over possible orientations is computed. 
Although more model-dependent, this approach has 
the advantage of producing the evolutionary tracks 
of several galaxy types with a self-consistent treat- 
ment of the non-stellar components (dust and neb- 
ular emission). An application of the PEGASE 2.0 
code to photometric redshifts has been recently pre- 
sented by Le Borgnc &; Rocca-Volmcrange 2002 



We have followed the training technique described 
in Sec. 3.1.2. Adopting the scenarios described in 
Lc Borgnc & Rocca-Volmerange 2002 we have ob- 



tained three samples from the RVOO package: 112824, 
28544 and 14400 models with step in redshift dz = 
0.025, dz = 0.1 and dz = 0.2, respectively (0< z <6). 
An other training sample has been obtained from the 
112824 sample dimming the fluxes by a factor of 10 
and 100 and considering as the training set the tern- 
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Fig. 8. Comparison between spectroscopic redshift in the 
HDF-S and the neural redshift obtained with a committee 
of networks and using as input pattern the colors. The 
estimation of the redshift for each object is the mean of 
100 predictions and the error bars represent 1-er interval. 
The training set is composed of RVOO models (bootstrap 
on 1000 RVOO SEDs and 150 spectroscopic redshifts in 
the HDF-N, see Table |3J). The symbols are the same as in 
Fig.H 

plates with apparent luminosity in the F814 band 
less than 27, in this way 201757 objects have been 
carried out. 

In the training on mixed samples the RVOO 
templates produce slightly better results than the 
CWWK SEDs. A bootstrap process of 100 extrac- 
tions has been carried out: at each extraction a ran- 
dom sequence of the input patterns and a random 
initialization of the weights have been adopted. At 
each extraction the training has been computed on a 
set of data composed by 150 spectroscopic redshifts 
in the HDF-N and a subset of models extracted ran- 
domly from the RVOO samples. The performance in 
the south sample is cx* es ' ~ 0.12 (see Table 0J). 

Fig. shows that no significant trend is present 
over the epochs varying the initial distribution of 
weights and the sequence of the training data (in the 
abscissa the epochs and in the ordinate the differ- 
ence zNN — zspec). The prediction of the network 
becomes stable after the first epochs (greater than 
500) until the maximum epoch (20000). The spread 
in the plots gives an indication of the resulting un- 
certainty (also the spread is stable over the epochs). 



Fig. 10. Comparison between spectroscopic redshift in the 
HDF-S and the neural redshift obtained with a committee 
of networks and using as input pattern the colors. The 
estimation of the redshift for each object is the mean of 100 
predictions and the error bars represent 1-a interval. The 
training set is composed of CWWK SEDs (3206 SEDs, see 
Table |2J ■ The symbols are the same as in Fig. 

Adopting a training set composed of RVOO, 
CWWK and the spectroscopic sample in the HDF- 
N produces a <r* es * ~ 0.12, of the same order of 
the dispersions obtained with RV00+HDF-N and 
CWWK+HDF-N as training sets. 

5.2.3. Training on CWWK or RVOO templates 

Table 0] summarizes the results of various trainings 
carried out only on templates, without the spectro- 
scopic redshifts. 

Training on the colors derived from the CWWK 
templates produces a dispersion in the HDF-S sam- 
ple a\ est = 0.186/0.180 (mean/median) (see Fig. irU|) . 
A redshift step dz = 0.01 and an extinction E(B — 
V) = 0.0 were adopted (3206 SEDs in the training 
set). A bootstrap on 100 extractions with maximum 
number of epochs set to 1000 was carried out. Again, 
introducing the effects of extinction does not improve 
this result. 

Training on the colors and the apparent luminos- 
ity in the F814 band (7 inputs) derived from the 
RVOO models produces a dispersion in the HDF-S 
sample <r* es * = 0.158/0.153 (mean, median), better 
than the estimates obtained with the CWWK SEDs. 
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Table 3. Training of different architectures on the HDF-N spectroscopic sample and a set of templates derived from 
Rocca Volmerange (redshift in the interval z = — 6). The evaluation is on the HDF-S spectroscopic sample. The 
bootstrap has been computed on 100 extractions (100 members of the committee). In the training data column, "+150" 
means that the 150 spectroscopic redshifts in the HDF-N have been used in addition to the RV00 models. 
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Table 4. Training with various NN architectures on templates derived from CWWK and RV00. The bootstrap has 
been computed on 100 extractions (100 members of the committee). 
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Fig. 11. Comparison between spectroscopic redshift in the 
HDF-S and the neural redshift obtained with a committee 
of networks and using as input pattern the colors. The 
estimation of the redshift for each object is the mean of 100 
predictions and the error bars represent 1-a interval. The 
training set is composed of RVOO models (112824 SEDs, 
see Table 0}. The symbols are the same as in Fig.0] 



Table 5. Summary of the different tests performed on 
the HDF-S spectroscopic sample (z < 3.5, 44 objects) 
described in Section 5. The dispersion a z is calculated in 
a low redshift regime z < 2 (34 objects) and high redshift 
regime z > 2 (10 objects). 



Training set 


o z (z < 3.5) 


a z (z < 2) 


Oz {z > 2) 




44 objs. 


34 objs. 


10 objs. 


HDF-N 


0.172 


0.186 


0.114 


HDF-N mag. 


0.162 


0.139 


0.222 


CWWK & HDF-N 


0.128 


0.131 


0.114 


RVOO & HDF-N 


0.118 


0.128 


0.094 


CWWK 


0.186 


0.146 


0.282 


RVOO 


0.153 


0.115 


0.237 



Fig. El compares the prediction of a NN trained 
on the RVOO templates with the spectroscopic red- 
shifts in the HDF-N and HDF-S. The dispersion 
turns out to be cx* es ' = 0.231 for the full HDF-N 
plus HDF-S sample and 0.259 for the HDF-N only. 

In Table 03 the tests on the HDF-S spectroscopic 
sample are summarized. The dispersion is calculated 
for 44 objects at z < 3.5 and separately in the low- 
redshift (z < 2) and high-redshift (z > 2) regimes. 
In general the performance improves when the infor- 
mation in the training set increase. 



Vanzella E. et al.: Photometric redshifts with a MLP Neural Network 



13 



Committee of MLPs — Training on RVOO models 



• HDF-S zspec 
° HDF-N zspec 




HDF N HDF N/S 
ct z = 0.259 ct z =0.231 
5=0.061 5=0.064 



Fig. 12. Comparison between spectroscopic redshifts 
(HDF-N and HDF-S) and the neural redshifts obtained 
with a committee of networks, using as an input pattern 
the colors and the apparent luminosity in the F814 band 
derived from RVOO models. The estimation of the redshift 
for each object is the median of 100 predictions and the 
error bars represent the 1-cr interval. 



6. Application to the SDSS DR1 

The Sloan D igital Sky Survey 2 (SDSS; 
lYork et al. 2000]) consortium has publicly 
released 134015 spectroscopic redshifts 
(Abazajian et al. 2003). The photometry in the 
ugriz bands and various image morphological 
parameters are also available. 

Tagliaferri et al. 2002 



Recently, 



and 



IFirth et al. 2002*1 have used neural networks to pro- 
duce photometric redshifts based on the SDSS Early 
Data Release (SDSS EDR, |Stoughton et al. 2002D , 
while IBall et al. 20031 have applied neural networks 
to the DR1 sample. 



2 Funding for the creation and distribution of the SDSS 
Archive has been provided by the Alfred P. Sloan Foundation, 
the Participating Institutions, the National Aeronautics and 
Space Administration, the National Science Foundation, the 
U.S. Department of Energy, the Japanese Monbukagakusho, 
and the Max Planck Society. The SDSS Web site is 
http://www.sdss.org/ The Participating Institutions are 
The University of Chicago, Fermilab, the Institute for 
Advanced Study, the Japan Participation Group, The Johns 
Hopkins University, the Max-Planck-Institute for Astronomy 
(MPIA), the Max-Planck-Institute for Astrophysics (MPA), 
New Mexico State University, Princeton University, the United 
States Naval Observatory, and the University of Washington. 
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Fig. 13. Redshift distribution of the spectroscopic sample 
obtained from the SDSS DR1 (113000 galaxies, solid line). 
The dashed line represents the distribution of the neu- 
ral redshift prediction of the test sample (88108 galaxies) 
normalized to the total sample obtained with a 19:12:10:1 
architecture (see text). 

We have selected the data with the following cri- 
teria (see also IFirth et al. 2002|) : (1) the spectro- 
scopic redshift confidence must be greater than 0.95 
and there must be no warning flags, (2) r < 17.5. 
Moreover we have adopted the photometric criteria 
proposed in Yasuda at al. (2001) for the star-galaxy 
separation. An object is classified as a star in any 
band if the model magnitude and the PSF magnitude 
differ by no more than 0.145. The resulting catalog 
is almost entirely limited to z < 0.4. The redshift 
distribution of the DR1 sample is shown in Fig. 1131 

Two different approaches have been explored in 
the NN estimation of the DR1 photometric redshifts: 

1. A 7:12:10:1 network with 3000 epochs and 10 dif- 
ferent trainings, carried out changing the initial 
random distributions of weights and the sequence 
of the training examples. The "best" distribution 
of weights corresponds to the lowest error in the 
training sample (in almost all cases coincident 
with the last epoch). The 7 input nodes are: the 
colors, the r-band magnitude, the Petrosian 50 
and 90 per cent r-band flux radii (u — g, g — r, 
r-i,i-z,r, PetRbO, PetR90). 

2. A 19:12:10:1 network with 15000 epochs and a 
single training carried out. The additional inputs 
are in this case the u-, g-, i-, z-band magnitudes 
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SDSS DR1 - Net 19:12:10:1 - 15000 Epochs 




o 7 Training Set (24892 Gal 
a Test Set. (88108 Gal.) 



J I I I I I I I I I I I I I I I I l_ 

° 1 2 3 

Log 10 (Number of Epochs) 

Fig. 14. Behavior of the prediction as a function of the 
epochs for the SDSS DR1 sample. The non-uniform train- 
ing sample has been used with the 19:12:10:1 architecture. 
3000 epochs have been computed, the training and test er- 
rors are shown as a function of the epoch. 

and the Petrosian 50 and 90 per cent flux radii in 
these bands. 

The results in terms of dispersions [a z and |A 2 |) and 
mean offsets < A z > are summarized in Table H3 
Increasing the number of input nodes and the num- 
ber of epochs improves only slightly the result. In 
particular, Fig. 1141 shows the behavior of the training 
error for the 19:12:10:1 network function of the 
"current" epoch, shown until the maximum epoch 
3000. It is worth noting that, because of the incre- 
mental learning method used in the present work (see 
Sect. 2.1.2), each epoch corresponds to a number of 
variations of weights equal to the number of training 
examples in the training set. This explains why the 
predictions of the network are good also at the very 
beginning (epoch 1) of the training phase. 

The highly inhomogeneous distribution of the red- 
shifts (see Fig. ITB*)) is expected to produce a bias in 
the estimates, as discussed in Tagli aferri et al. 2002} 
since any network will tend to perform better in 
the range where the density of the training points is 
higher. To investigate this effect two types of train- 
ing have been carried out: on a uniform training set 
and a randomly extracted training set. The random 
and the uniform training sets are both made of 24892 
galaxies. In the cases of randomly extracted training 
sets (Fig. El upper panels), a trend in the training 
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o o.: o.2 0.3 o o.: a.S c.3 
Spectroscopic Redshifl 

Fig. 15. Redshift prediction in the SDSS DR1 (113000 
galaxies) spectroscopic sample using a 19:12:10:1 archi- 
tecture, 3000 epochs and 19 inputs (u-g, g-r, r-i, i-z, u, g, 
r, i, z, PetU50, PetU90, PetG50, PetG90, PetR50, PetR90, 
PetI50, PetI90, PetZ50, PetZ90) as input pattern. In the 
lower panel (training set on the left, test set on the right) 
the training set has been built adopting a grid with a fixed 
step dz—Q. 000012 and extracting one galaxy for each in- 
terval of the grid (24892 galaxies in total). In the upper 
panel (training set on the left, test set on the right) the 
training set has been built extracting randomly a sample 
of the same size (24892 galaxies) of the uniform sample. 
In left panels only one point every 16 is plotted, while in 
the right panels only a point every 50 is plotted. 



and test phase is evident. It appears as a distortion 
around z~0.1, corresponding to the higher density of 
training points (see Fig. I1M|) . The behavior of the di- 
agram using a uniform training set is more regular 
(Fig. [Tol lower panels). 

Due to the large amount of data available, the 
trainings with and without the validation set have 
produced indistinguishable results. Also the disper- 
sion obtained with a committee of networks and with 
a single member is comparable, therefore no regular- 
ization has been applied and a single training has 
been adopted in all cases. 

Increasing the number of connections in the ar- 
chitecture of the network does not cause the results 
to change significantly. It is interesting to note that 
even with a simple network 7:2:5:1 (34 weights and 
7 input neurons), the dispersion obtained is compa- 
rable to the 381 weights net (19:12:10:1). The 7:2:5:1 
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Table 6. SDSS - DR1: Training on 24892 galaxies (uniform and random sample). Test on 88108 galaxies. The mean 
values are derived from 10 trainings by varying the initial random distribution of weights and the sequence of the 
training examples. In the first 2 rows 7 inputs nodes have been used (u-g, g-r, r-i, i-z, r, PetR50, PetR90). Rows 3 and 
4 correspond to a single training and 19 inputs have been used (u-g, g-r, r-i, i-z, u, g, r, i, z, PetU50, PetU90, PetG50, 
PetG90, PetR50, PetR90, PetKO, PetI90, PetZ50, PetZ90). 
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gives a z ~ 0.027 (|A 2 | ~ 0.021) in the 88108 test 
galaxies sample. 

Various photometric redshift techniques 
(template-fitting, Bayesian method, polynomial 
fitting, nearest-neighbor etc.) have been applied to 
a similar spectroscopic sample extracted from the 
SDSS EDR (see 



u et al. 2002|) . They produce 
in general significantly worse results in terms of 
redshift dispersion, except for the "Kd-tree", which 
shows a a 7 = 0.025. 



Committee of MLPs 



Net 6:20:20:1 



7. Summary and conclusions 

We have presented a new technique for the estimation 
of redshifts based on feed- forward neural networks. 
The neural architecture has been tested on a spec- 
troscopic sample in the HDF-S (44 objects) in the 
range 0.1 < z < 3.5 and on a large sample (113000 
galaxies) derived from the SDSS DR1. 

The flexibility offered by NNs allows us to train 
the networks on sets that are homogeneous (i.e. on 
spectroscopic redshifts or simulated templates) or 
mixed (e.g. on spectroscopic redshifts and simulated 
data). The galaxy templates for the training of the 
NNs with simulated data have been derived from ob- 
servational samples (the CWWK SEDs) and from 
theoretical data (Pegase models). 

The training on the theoretical data (colors and I 
mag. as input pattern) produces a a z est in the HDF-S 
of the order of 0.15 (RV00), while the training on the 
HDF-N spectroscopic sample produces a z est ~ 0.18 
(colors as input pattern) and a z est ~ 0.15 (colors and 
apparent I luminosity as input pattern) . The training 
on mixed samples (observed SEDs with spectroscopic 
redshift (HDF-N) and theoretical SEDs (CWWK or 
RV00 models)) improves the prediction, and a dis- 
persion of the order of a z est ~ 0.11 is reached. 



_ o Training set 



Evaluation set: 

• Sawiky et al. (2003) 
■ tabbe et al. (2003) 
° Trujillo et al. (2003) 
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Fig. 16. Comparison between spectroscopic redshift in the 
HDF-S and the neural redshift obtained with a com- 
mittee of networks and using as input pattern the col- 
ors. The estimation of the redshift for each object is the 
mean of 100 predictions and the error bars represent 1- 
a interval. In the left panels, the training set is com- 
posed of 150 (HDF-N) and 44 (HDF-S) spectroscopic 
redshifts (open circles). The evaluation has been done 
on the recent sample of spectroscopic redshifts (z<l) 
provided by Sawicki et al. (2003) filled circles, and on 



the large spiral galaxy at z=1.439, square filled symbol 
(La bbe et al. 2003)l and on the galaxy at z= 1.248, open 
square symbol (Truj illo et al. 2003). In the right panels 
only the evaluation symbols are shown. 



At the end of the training the NN contains "ex- 
perience" that is a combination of the observed data 
and the models. 
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It is interesting to note that the spectroscopic 
sample in the HDF-S can be used either clS db part 
of the training set or as a validation set in or- 
der to calibrate and tune the prediction (at least 
for the brighter objects) and that with the increas- 
ing availability of spectroscopic redshift the predic- 
tion can be continually improved. As an example 
we have used both the HDF-N and the HDF-S 
spectroscopic samples (194 objects in total) to pre- 
dict with a 6:20:20:1 architecture the redshifts of 33 
galaxies in the range 0.1 < z < 1.5 recently pub- 
lished by|Sawicki et al. (2003)[lLabhe et al. 2003l and 
Trujillo et al. 2003| The resulting dispersion turns 
out to be a z = 0.066 (Fig. EE)). 

A reference dataset estimating photometric red- 
shifts in the HDF-S down to Iab — 27 has been pro- 
duced: the training has been performed on a set com- 
posed of RV00 models, 150 spectroscopic redshifts in 
the HDF-N and 77 spectroscopic redshifts in HDF-S. 

The better generalization obtained using a com- 
mittee of networks with respect to a single network is 
more evident in the case of small training sets (Sec. 
5.1 and 5.2). If the training set is sufficiently com- 
plete and representative, good generalization can be 
achieved also with a single training. 

In summary the NN approach introduces the fol- 
lowing advantages: 

1. Rapidity in the evaluation phase with respect to 
more conventional techniques and possibility to 
deal with very large datasets. The redshifts of 10 5 
galaxies can be estimated in few seconds (using a 
laptop with PHI, 1.1 GHz). 

2. The system can quickly learn new information, for 
example when new spectroscopic redshifts become 
available. 

3. A priori knowledge (such as morphological prop- 
erties, apparent luminosity, etc.) can be taken into 
account. 

4. There are no assumptions concerning the distri- 
bution of the input variables. 

5. Feed- forward NNs can also be implemented via 
hardware, in the so called machine learning 
scheme. Neural processors have the same general- 
ization and learning ability as the MLP simulated 
via software ( Battiti fc Tecch iolli 1995|) . but with 
an extremely high velocity performance (10 6-7 
galaxies per second, a very useful feature in the 
training phase). 

Future developments include a better treatment of 
photometric errors and upper limits, and the recog- 
nition of characteristics of the galaxies (e.g. the type) 
from the input colors and/or morphological features 
(such as the Sersic index, luminosity profiles, etc.). 
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